Structural insights into i-motif DNA structures in sequences from the insulin-linked polymorphic region

The insulin-linked polymorphic region is a variable number of tandem repeats region of DNA in the promoter of the insulin gene that regulates transcription of insulin. This region is known to form the alternative DNA structures, i-motifs and G-quadruplexes. Individuals have different sequence variants of tandem repeats and although previous work investigated the effects of some variants on G-quadruplex formation, there is not a clear picture of the relationship between the sequence diversity, the DNA structures formed, and the functional effects on insulin gene expression. Here we show that different sequence variants of the insulin linked polymorphic region form different DNA structures in vitro. Additionally, reporter genes in cellulo indicate that insulin expression may change depending on which DNA structures form. We report the crystal structure and dynamics of an intramolecular i-motif, which reveal sequences within the loop regions forming additional stabilising interactions that are critical to formation of stable i-motif structures. The outcomes of this work reveal the detail in formation of stable i-motif DNA structures, with potential for rational based drug design for compounds to target i-motif DNA.


1) SUPPLEMENTARY NOTE 1 Supplementary results and discussion for enhanced sampling molecular dynamics simulations
The initial ILPR i-motif crystal structure consists of two near-identical motifs packed into a dimer as a result of crystallisation.Given the interactions between the sequences observed in the crystal structure originate from the flanking sequence, we looked at sequence 4C (TATCCCCACACCCCTATCCCCACACCCCTAT) and also an analogue with one base missing at the 3'-end of the flanking sequence, 4Cdel (TATCCCCACACCCCTATCCCCACACCCCTA).
The structures were prepared using the tleap module of the AmberTools20 package. 1 With regards to force-field parameters, a hybrid model, combining both bsc1 and OL15 modifications were used. 2,3 s with all AMBER force-fields, these are additive and can be combined.Bsc1 is currently the best descriptor of alpha/gamma torsion angles, while OL15 best describes ε/ dihedrals and non-bonded interactions of bases.Further to these forcefield parameters for protonated cytosine were taken from the AMBER force-field library: 'all_prot_nucleic10.lib'.In the case of the protonated structures, two of the four strands were protonated at the N3 position (C4-7 and C11-14), while the parameters mentioned previously apply to the remaining nucleotides.
Sufficient K + was added to the systems to neutralise the negatively charged phosphate backbones, as well as the C + atoms in the case of the protonated systems.Then excess K + and Cl -atoms were added to simulate a 0.15 mol L -1 salt concentration.The systems were solvated using the TIP3P water model in an orthorhombic box, with all solute atoms at a minimum distance of 12 Å from the edge of the solvation box.
Each system was minimised and equilibrated under NVT conditions for 5 ns at 1 atm.The temperature was set to 300K using a timestep of 4 fs, all rigid bonds, a cut-off of 9 Å and particle mesh Ewald summation (PME) switched on for long-range electrostatics.The nucleic acid was initially fixed during equilibration, but these restraints were gradually removed, while the ions and solvent molecules were allowed to move.The heavy atoms of the nucleic acid were constrained by a spring constant set at 1 kcal mol −1 Å −2 .The production simulations were run in the NVT ensemble using a Langevin thermostat with a damping of 1.0 ps−1 and hydrogen mass repartitioning scheme to achieve time steps of 4 fs.The AdaptiveBandit algorithm is implemented here, 4 and after each short run (50 ns) the algorithm is given a decision to select a pre-sampled conformation from which to restart the next short simulation -thus the objective is to avoid redundant sampling (i.e.visiting states which have already been sampled).The decision that is made aims to maximise a reward function.Here the reward function was defined to be proportional to relative phosphate-phosphate distances throughout the structure.The decision made each time aims to maximise the conformational space explored, using phosphate-phosphate distances as a metric.
The i-motif loops represent distinct sampling problems in that the core structure, stabilised by the hemi-protonated cytosine pairs, is largely stable with minor fluctuations.However, the loop regions are highly dynamic, able to sample multiple conformations with varying degrees of stability.The phosphate-phosphate distances from the structure were used as the Adaptive Bandit metric for the first 20 μs of simulation, after this only phosphate-phosphate distances in loop regions (8-10, 15-17, 22-24) were used as metrics.The loop-only sampling systems were run for a total of 30 μs.Thus, each system was run for a total sampling time of at least 50 μs.
As previously mentioned, i-motif loops represent a distinct sampling problem and conventional molecular dynamics simulations are not able to fully explore the conformational space in which it exists.This is overcome by use of adaptive sampling algorithms such as Adaptive Bandit, however the output from these many short simulations cannot be analysed directly therefore we need to integrate these simulations into Markov state models (MSMs).
MSMs allow the integration of multiple simulation trajectories into a single model of the conformational landscape that contains both kinetic and thermodynamic information as well as maintain structural detail.MSMs are built on inter-state transitions; all of these states can be compiled and used to create a single model.
Here we used the PyEMMA package to create our models. 5The general workflow for creating a Markov goes as follows.First, we use Time-lagged Independent Component Analysis (tICA) to reduce the dimensionality of the input feature.This differs from Principal Component Analysis in that it expands this reduced representation into the time domain.The data from tICA is then projected onto the first two (i.e. two slowest) ICs at a selected lag time and then clustered using the k-means algorithm.The number of clusters chosen represents the number of clusters at which the VAMP2 scoring function no longer increases.This allows the numerous conformations to be mapped onto a set of discrete states.This then allows MSMs to be created that describe transitions that occur between any two clusters.The MSM calculates these transitions between states at a given lag time.It should be noted that this lag time is independent from that used in tICA.Choice of MSM lag time is key in that it must be sufficiently long to ensure Markovian dynamics (i.e. the system is memory-less and each transition isn't dependent on the previous state) but short enough to still be able to resolve the dynamics of the system.In order to choose this lag time we choose the smallest possible lag time at which the underlying processes (represented in an implied-timescale plot) show convergence.This model can then be validated using the Chapman-Kolmogorov test, which tests the ability of the model at longer time scales.The original space can then be discretized into the longest-living metastable states according to their eigenvectors.Then, the most stable structure is then extracted from each metastable state, as well as a distribution of 50 structures with an associated trajectory file to display transition between all 50 structures in a particular state.Finally, the PCCA++ algorithm is used to coarse-grain the conformational space of the metastable states and then approximate the stationary distribution of the data as well as approximating the relative free energies.A total of 6 Markov state models were created to describe different aspects of the dynamics of the i-motif structure, details of which are displayed in Table S10.A combination of / angles and the distance between N3 atoms of the stem cytosines were used as features to describe the central stem, while a combination of / angles and χ angles were used as features to describe the dynamics of the loops.
In the ILPR i-motif structure there are three loops, 1 (A8C9A10) and 3 (A22C23A24) are distal to the 5' and 3' ends, while loop 2 (T15A16T17) lies in the middle at the sequence and is flanked in space by both the 5' and 3' prime ends.4Cdel, loop 1 consists of C9 protruding away from the centre of the structure and into the solvent, while A10 intercalates between A8 and C7, interacting with both via -stacking, as well as forming a non-canonical base pair with A22.In loop 3, A22 is pointed towards the centre of the structure and caps C21, while C23 and C24 protrude into the solvent (as a result of crystal packing) with their faces around 6.5 Å apart.
In loop 2, T17 interacts with T3 via two hydrogen bonds formed between (O4 and N3), this pair forms in parallel to the C:C + pair above and stabilises these via - interactions.T17 itself is stabilised via -stacking with T16 below.T15 forms no obvious actions and seems to shield the internal structure of the i-motif from solvent.The 4C structure is near-identical apart from A10, which instead of intercalating between A8 and C7, is flipped outward into the solvent, A8 instead, forms the non-canonical base pair with A22.This is also the case with A16, which protrudes into the solvent instead of stacking below T17, which itself is instead stabilised from below by T29.
In order to assess the individual components that contribute to structural stability within the i-motif, MSMs were created with features extracted from only loop regions (bases 8-10, 15-17 & 22-24) and then with features from stem regions of the structure (bases 4-7, 11-14, 18-21).As expected, the loop regions are far more flexible and the majority of dynamic motion within the structure can be attributed to these regions.This is borne out in the fact that the stem regions required significantly larger tICA (40 ns) and MSM lag times, than the models of the loop systems, to fully describe the conformational landscape.This is also displayed by the findings that the stem only exists in a two-state system, displayed by a double-well potential (Figure S18/19).These two states have an RMSD < 1.2 Å, whereas the conformational landscapes are split into many more metastable states with significantly higher RMSD values.
Although not a perfect comparison, this agrees with similar work on G-quadruplex structures using similar force-fields, as such validating the fact that these are suitable to study i-motif stem structures and C:C + pairs.

Sampling of the crystal structure conformations
The crystalline unit cell consists of two individual strands related by an inverted symmetry.
The two strands (4C and 4Cdel from here onwards) are linked with the 5' end of each unit interacting with loop 3 of the other.However, the interactions in these regions are not identical; A2 of strand 4C is able to form two hydrogen bonds with A24 in strand B due to a slight misalignment this interaction does not occur in the analogous region in the inverted symmetry.However, T1 stabilises A24 via a -stacking interaction in both cases.
The ability of a simulation to fully sample its starting structure is often used as a benchmark to validate the accuracy of the sampled conformation.In our case of the ILPR i-motif, several favourable interactions occur as a result of crystal packing and therefore the conformation found within the crystal structure may not be a realistic depiction of the structure in vivo.
Given that once these interactions are removed, they are likely to be compensated by other interactions.Thus, the crystal structure can no longer be defined as the most stable state and may rarely be visited throughout the course of a simulation.This is particularly relevant in the case of loop 3 where crystal packing effects lead to complete solvent exposure when the structures formed in the crystalline lattice are separated.Still, assessing whether the conformations adopted by the loops, during the simulations, are able to sample their crystal structure conformation is nevertheless a useful benchmark to assess the accuracy and quality of the model created.All the loops are able to sample conformations observed in the crystalline lattice in spite of the loss of crystal packing interactions.For example, A8-A10 (loop 1), A16-T17 (loop 2) and A22-C21 (loop 3) -stacking interactions in strand A and C7-A8 (loop1), T15 -stacking below T3:T17 base pair and C21-A22 (loop3) -stacking interactions in strand B are all observed (Figure S20).

Loop Dynamics
Given the intrinsic rigidity of the i-motif stems enforced by the hydrogen bonding between cytosines and repulsive forces between neighbouring carbonyl and amino groups in the C:C + pair stacks, the regions either side of the stem (loop 1/3 and loop 2) can be viewed as a two different sub-systems as the dynamics in one region is likely to have a minimal effect on that of the other.Therefore, in order to assess loop conformations more directly, MSMs were created for the sequences 4C and 4Cdel in which only loop features were used as inputs for tICA, namely / angles and χ angles which have proven useful previously to assess loop dynamics in G-quadruplexes.

Loop 2
In 4C, the most stable conformation (state 4) represents 86% of all sampled conformations.This conformation samples the interactions observed in the crystal structure including the T3:T17 base pair.Stacked below the T3:T17 base pair is the T5:A2:T29 triad.In the triad, T15 is slightly off plane.Stacking directly below A2:T29 is T1:A30 base pair.The 3' end terminal T33 stacks below A30.A16 from loop 2 is solvent exposed and interacts with the 3' hydroxyl group of T33 (Figure S21).These series of base stacks between the flanking termini and loop 2 help to stabilise the structure.In 4Cdel, which is missing the terminal T in the sequence, loop 2 dynamics can be seen as the interconversion between four primary states as illustrated in the free energy landscape (Figure S22).
The most stable conformation (state 5), representing 39% of the equilibrium distribution system, perfectly samples the conformation displayed in the crystal structures; namely T15 tilted inwards, T3 in plane with T17 to form a non-canonical T:T base pair, with two hydrogen bonds formed between O4 and N3 atoms that stabilises the C:C+ pair above, as well A16 stacking below it.Interestingly T2 from the 5' end makes a canonical base pair with A16, to add further stability to the T3:T17 base pair.Finally, and as expected when crystal packing effects are removed T29 and A30 fold inwards and stack together (Figure S23).
When comparing the dynamics of loop 2 between sequences 4C and 4Cdel, it becomes evident that 4C is significantly more skewed towards a single structure with metastable state 4 representing 86% of the equilibrium population.This skewing might be a result of the longer 3' end which leads to enhanced -stacking and hydrogen bond interactions with loop 2, and thus leading to greater stability.The elongated 3' end also presents a more plausible version of what may happen in vivo as there are likely to be at least three residues between the start/end of the i-motif and the start of other canonical/non-canonical nucleic acid structures.
It is also worth noting that the position and orientation of T15 is unchanged across the metastable structures, highlighting to its importance in maintaining structural integrity within this region.Given its position and orientation and the interactions it makes with A2, it is likely that A15 acts like a hydrophobic cap that impedes the entrance of water into the core of the structure and therefore likely interference with the C•C + pair.Finally, our results validate the importance of non-canonical base pairs acting as capping residues in i-motif structures.All structures contain a T:T base pair between T3 and T17 that is analogous to that of the C:C + pair and thus extends the loop and further stabilised the terminal pair of the stem through favourable stacking of exocyclic carbonyls and amino groups in an antiparallel manner.

Loops 1 and 3
To assess the dynamics of loops of 1 and 3 only, without the models being affected by the dynamics around loop 2, MSMs were then created using only the / angles and χ angles of loop 1 and 3 only as they act as a singular interacting system and their dynamics should be taken together into account.The conformational space of loops 1 and 3 in strand A can be described as the interconversion between 6 metastable states (Figure S24).The most populated state is state 6 accounting for 52% of all sampled conformations.In loop 1, A10 is able to stack on C7 to form a non-canonical base pair interaction with A22, with hydrogen bonds formed between the N6 and N7 atoms in both bases.Interestingly, in this cluster we observe stacking of C9 above A22 and A8 above A10.Metastable state 6 recapitulates the conformations and interaction that are observed in the crystal structure (Figure S20).State 3 is a substate of state 6, where C9 stacks above A10:A22 base pair.In metastable state 5 (38% population), the A10:A22 base pair is formed but there is no stacking observed by A8 or C9 bases above the A10:A22 base pair.State 4 is a substate of metastable state 5, where the A10:A22 base pair is not formed, while A8 makes -stacking with A10.State 1 is a low populated, high energy conformation where loop 1 and 3 are collapsed on the terminal C7:C21 base pair.The A10:A22 base pair that stacks on top of C7:C21 base pair is off centred, while C9 stacks on top of A10:A22 base pair instead of A8, which is tucked in a groove and is shielded from the solvent by A24.Nucleotide A23 is solvent exposed.State 2 population represents the open conformation of the loop 1 and 3.In this state, the loops move outward exposing the A10:A22 base pair and in some instances the C7:C21 base pairs.Loop 1 and 3 in 4C are separated by tICA into 4 metastable states (Figure S25).State 1 in 4C represents the crystalline conformation where A8:A22 base pair is present, while bases C9, A10, C23 and A24 are solvent exposed.In state 2, the A8 and A22 base pair is lost and the sugar phosphate backbone of loops 1 and 3 is in a constrained conformation pointing towards the central i-motif stem.Unlike 4Cdel, loop 1 in the most populated state 3 (91%) of 4C does not fully sample the conformation found in the crystal structure.The A8:A22 base pair is abolished and A22 is instead stabilised by forming two hydrogen bonds with its N6 amine, making interactions with the O4' and N3 atoms in A10.A8 is stabilised through the contact between its N6 amine and the phosphate oxygen in A24.A23 and A24 stack next to each other in a vertical orientation, this stacking arrangement is further stabilised by the interaction between the A24 N6 amine and the C23 phosphate group.Furthermore, the slight inward rotation of loop 3 allows the stabilisation of the C7 N4 amine via interaction with the A24 phosphate.State 4 is similar to state 3.The only difference is the arrangement of C9, A10, C23 and A24 bases in loops 1 and 3 respectively.To summarise, upon inspection of structures extracted from coarsegrained models, 4Cdel featured far more unstructured conformations in loops 1 and 3 while those in 4C, with the full flanking sequence, are well ordered.

2) SUPPLEMENTARY NOTE 2 Sanger sequencing of sub-cloned INS421 plasmid with different ILPR inserts
The FASTA format of the sequences is shown after processing of Sanger sequencing as following; lowercase letters are the original pGL410_INS421 firefly expression plasmid with uppercase letters representing the human insulin promotor (a gift from Kevin Ferreri via Addgene; plasmid #49057; http://n2t.net/addgene:49057;RRID:Addgene_49057) 6 underlined and bold letters represent identifying restriction digest site for ILPR variant confirmation; bold and uppercase letters are showing the selected ILPR variants for reporter gene assay.

FSupplementary Figure 4 .FSupplementary Figure 24 :
transitional pH (pHT) from the inflection point of the Boltzmann sigmoidal.Source data for this figure are provided as a Source Data file.Supplementary Figure 3. A-C) CD spectroscopy of C-rich ILPR sequences 6C, 7C, 8C. 10 µM DNA in 10 mM NaCaco 100 mM KCl and pH as indicated.D-F) Corresponding plot ellipticity of the experimental repeats (n=3, mean ± SD) at 288 nm at the different pHs to determine transitional pH (pHT) from the inflection point of the Boltzmann sigmoidal.Source data for this figure are provided as a Source Data file.D E A-C) CD spectroscopy of C-rich ILPR sequences 9C, 10C, 11C. 10 µM DNA in 10 mM NaCaco 100 mM KCl and pH as indicated.D-F) Corresponding plot ellipticity of the experimental repeats (n=3, mean ± SD) at 288 nm at the different pHs to determine transitional pH (pHT) from the inflection point of the Boltzmann sigmoidal.Source data for this figure are provided as a Source Data file.D E Loop2 interactions with the 5'/3' flanking nucleotides in (A) 4Cdel and (B) 4C.In 4Cdel, one nucleotide is less in the 3' flank, in addition to the T13:T17 base pair, A16 makes hydrogen bond with T1, T15 with A2.T1 from the 5' flank is tilted inwards forming a 3-base π-stack with T29 and T30 from the 3' flanking nucleotides.In 4C, an extra nucleotide in the 3' flank results in enhanced hydrogen bonding and π-stacking interactions leading to a more structured capping of the end of the 4C i-motif.The side and top views if the interactions have been illustrated of Loop 2 (pink) and flanking nucleotides (cyan).

Table 3 .
Data collection and refinement statistics for 4C.Values in parentheses are for the highest-resolution shell.

Table 4 .
Data collection and processing statistics for 4C-Br.Values in parentheses are for the highest-resolution shell.

Table 5 .
Narrow and wide groove dimensions based on P -P distances of adjacent strands.Average dimensions for each groove are shown in the schematic in Supplementary Figure12with P -P distances of nucleotides from the cytosine rich core, loop or flank regions included in the calculation.

1 -Core Loop 1 -Loop 3 SupplementaryTable 7 .
Intermolecular interactions between strands A and B of the 4C imotif based on the crystal structure.